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Abstract 

This paper investigates graph clustering in the planted cluster model in the presence of small clusters. 
Traditional results dictate that for an algorithm to provably correctly recover the clusters, all clusters 
must be sufficiently large (in particular, Q{^/n) where n is the number of nodes of the graph). We show 
that this is not really a restriction: by a more refined analysis of the trace-norm based matrix recovery 
approach proposed in Jalali et al. 2011 and Chen et al. [2012 , we prove that small clusters, under 
certain mild assumptions, do not hinder recovery of large ones. Based on this result, we further devise 
an iterative algorithm to recover almost all clusters via a "peeling strategy", i.e., recover large clusters 
first, leading to a reduced problem, and repeat this procedure. These results are extended to the partial 
observation setting, in which only a (chosen) part of the graph is observed. The peeling strategy gives 
rise to an active learning algorithm, in which edges adjacent to smaller clusters are queried more often 
as large clusters are learned (and removed). 

From a high level, this paper sheds novel insights on high-dimensional statistics and learning struc- 
tured data, by presenting a structured matrix learning problem for which a one shot convex relaxation 
approach necessarily fails, but a carefully constructed sequence of convex relaxations does the job. 



1 Introduction 



This paper considers a classic problem in machine learning and theoretical computer science, namely graph 
clustering, i.e., given an undirected unweighted graph, partition the nodes into disjoint clusters, so that the 
density of edges within one cluster is higher than those across clusters. Graph clustering arises naturally in 
many application across science and engineering. Some prominent examples include community detection in 



social network Mishra et al. 2007 , submarket identification in E-commerce and sponsored search Yahoo!-Inc 
[2009, , and co-authorship analysis in analyzing document database , Ester et al. [1995| , among others. From 
a purely binary classification theoretical point of view, the edges of the graph are (noisy) labels of similarity 
or affinity between pairs of objects, and the concept class consists of clusterings of the objects (encoded 
graphically by identifying clusters with cliques). 



Many theoretical results in graph clustering [e.g., Boppana 1987 Chen et al. , 2012 McSherry, 2001 



consider the planted partition model, in which the edges are generated randomly; see Section Tl for more 
details. While numerous different methods have been proposed, their performance guarantees all share the 
following manner - under certain condition of the density of edges (within clusters and across clusters), the 
proposed method succeeds to recover the correct clusters exactly if all clusters are larger than a threshold 
size, typically 17 (y^). 
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In this paper, we aim to break this small cluster barrier of graph clustering. Correctly identifying 
extremely small clusters is inherently hard as they are easily confused with "fake" clusters generated by 
noisy edge^ and is not the focus of this paper. Instead, in this paper we investigate a question that has not 
been addressed before; Can we still recover large clusters in the presence of small clusters? Intuitively, this 
should be doable. To illustrate, consider an extreme example where the given graph G consists two disjoint 
subgraphs Gi and 6*2, where Gi is a graph that can be correctly clustered using some existing method, and 
G2 is a small-size clique. G certainly violates the minimum cluster size requirement of previous results, but 
why should G2 spoil our ability to cluster Gi? 



Our main result confirms this intuition 
e.g 



We show that the cluster size barrier arising in previous 

is not 



Chaudhuri et al. 2012 Bollobas and Scott 2004 Chen et al. 2012 McSherry 2001 



work 

really a restriction, but rather an artifact of the attempt to solve the problem in a single shot using convex 
relaxation techniques. Using a more ca reful analysis, we prove that the mixed trace-norm and based 



convex formulation, initially proposed in 



Jalali et al. 



2011 



, can recover clusters of size ^l{y/n) even in the 



presence of smaller clusters. That is, small clusters do not interfere with recovery of the big clusters. 

The main implication of this result is that one can apply an iterative "peeling" strategy, recovering 
smaller and smaller clusters. The intuition is simple ~ suppose the number of clusters is limited, then either 
all clusters are large, or the sizes of the clusters vary significantly. The first case is obviously easy. The second 
one is equally easy: use the aforementioned convex formulation, the larger clusters can be correctly identified. 
If we remove all nodes from these larger clusters, the remaining subgraph contains significantly fewer nodes 
than the original graph, which leads to a much lower threshold on the size of the cluster for correct recovery, 
making it possible for correctly clustering some smaller clusters. By repeating this procedure, indeed, we 
can recover the cluster structure for almost all nodes with no lower hound on the minimal cluster size. We 
summarize our main contributions and techniques: 

(1) We provide a refined analysis of the mixed trace norm and ii convex relaxation approach for exact 
recovery of clusters proposed in Jalali et al. 2011 and Chen et al. [2012 , focusing on the case where small 



clusters exist. We show that in the classical planted partition settings Boppana 1987 , if each cluster is 
either large (more precisely, of size at least x ~ \/n log^ n) or small (of size at most x / log n) , then with high 
probability, this convex relaxation approach correctly identifies all big clusters while "ignoring" the small 
ones. Notice that the multiplicative gap between the two thresholds is logarithmic w.r.t. n. In addition, 
it is possible to arbitrarily increase x, thus turning a "knob" in quest of an interval (x/ log^ n, that is 
disjoint from the set of cluster sizes. The analysis is done by identifying a certain feasible solution to the 
convex program and proving its almost sure optimality using a careful construction of a dual certificate. This 
feasible solution easily identifies the big clusters. This method has been performed before only in the case 
where all clusters are of size > x. 

(2) We provide a converse of the result just described. More precisely, we show that if for some value 
of the knob x an optimal solution appears to look as if the interval {x/\o^ n, x) were indeed free of cluster 
sizes, then the solution is useful (in the sense that it correctly identifies big clusters) even if this weren't the 
case. 

(3) The last two points imply that if some interval of the form (a;/ log^ n, x) is free of cluster sizes, 
then an exhaustive search of this interval will constructively find big clusters (though not necessarily for that 
particular interval) . This gives rise to an iterative algorithm, using a "peeling strategy" , to recover smaller 
and smaller clusters that are otherwise impossible to recover. Using the "knob" , we prove that as long as the 
number oi clusters is bounded by r2(logn/loglogn), regardless of the cluster sizes, we can correctly recover 
the cluster structure for an overwhelming fraction of nodes. To the best of our knowledge, this is the first 
result of provably correct graph clustering without any assumptions on the cluster sizes. 

^Indeed, even in a more lenient setup where one clique (i.e., a perfect cluster) of size K is embedded in an Erdos-Renyi 
graph of n nodes and 0.5 probability of forming an edge, to recover this clique, the best known polynomial method requires 
K = f2(-^/n) and it has been a long standing open problem to relax this requirement. 
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(4) We extend the result to the partial observation case, where only a faction of similarity labels (i.e., 
edge/no edge) is known. As expected, smaller observation rates allow identification of larger clusters. Hence, 
the observation rate serves as the "knob" . This gives rise to an active learning algorithm for graph clustering 
based on adaptively increasing the rate of sampling in order to hit a "forbidden interval" free of cluster sizes, 
and concentrating on smaller inputs as we identify big clusters and peel them off. 

Beside these technical contributions, this paper provides novel insights into low-rank matrix recovery and 
more generally high-dimensional statistics, where data are typically assumed to obey certain low-dimensional 
structure. Numerous methods have been developed to exploit this a priori information so that a consistent 
estimator is possible even when the dimensionality of data is larger than the number of samples. Our 
result shows that one may combine these methods with a "peeling strategy" to further push the envelope of 
learning structured data - By iteratively recovering the easier structure and then reducing the problem size, 
it is possible to learn structures that are otherwise difficult using previous approaches. 



1.1 Previous work 



The literature of graph clustering is too vast for a detailed survey here; we concentrate on the most related 
work, and in specific those provide theoretical guarantees on cluster recovery. 



Planted partition model: 



1987 , also known as the stochastic block model Rohe et al. 2011 



The setup we study is the classical planted partition model Boppana 

Here, n nodes are partitioned into 



subsets, referred as the "true clusters" , and a graph is randomly generated as follows: for each pair of nodes, 
depending on whether they belong to a same subset, an edge connecting them is generated with a probability 
p 01 q respectively. The goal is to correctly recover the clusters given the random graph. The planted partition 
model has been studied as early as 1980's Boppana 1987]. Earlier work focused on the 2-partition or more 



generally Z-partition case, i.e., the minimal cluster size is 6(n) Boppana 1987 , Condon and Karp 2001 



Carson and Impagliazzo 2001 , BoUobas and Scott 2004]. Recently, several works have proposed methods 



to handle sublinear cluster sizes. These works can be roughly classified into three approaches: randomized 



algorithms [e.g., 


Shamir and Tsur 


2007 


Chaudhuri et al. 


2012 


Rohe et al. 


2011 



spectral clustering [e.g., McSherry 2001 Giesen and Mitsche 2005 



), and low-rank matrix decomposition Jalali et al. 2011 , Chen et al 



2012], Ames and Vavasis 2011 , Oymak and Hassibi 2011 . While these work differs in the methodology, 
they all impose constraints on the size of the minimum true cluster - the best result up-to-date requires it 
to be ^l{^/n). 



2004 



Correlation Clustering This problem, originally defined by Bansal, Blum and Chawla [Bansal et al.| 
also considers graph clustering but in an adversarial noise setting. The goal there is to find the 
clustering minimizing the total disagreement (intercluster edges plus intracluster nonedges), without there 
being necessarily a notion of true clustering (and hence no "exact recovery" ) . This problem is usually studied 
in the combinatorial optimization framework and is known to be NP-Hard to approximate to within some 
constant factor. Prominent work includes jPemaine et al. 2006 , Ailon et al. 2008 , Charikar et al. 2005 



A PTAS is known in case the number of clusters is fixed Giotis and Guruswami 2006 



Low rank matrix decomposition via trace norm: Motivated from robust PGA, it has recently 



been shown Chandrasekaran et al. 2011 , Candes et al. 2011 , that it is possible to recover a low-rank matrix 



from sparse errors of arbitrary magnitude, where the key ingredient is using trace norm (aka nuclear norm) 
as a convex surrogate of the rank. A similar result is also obtained when the low rank matrix is corrupted 



by other types of noise Xu et al. 2012 



Of particular relevance to this paper is Jalali et al. 2011 and Chen et al. 2012 , where the authors 



apply this approach to graph clustering, and specifically to the planted partition model. Indeed, |Chen et aL] 
2012 achieve state-of-art performance guarantees for the planted partition problem. However, they don't 
overcome the rii^/n) minimal cluster size lower bound. 
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Active learning/Active clustering Another line of work that motivates this paper is study of active 
learning algorithms (a settings in which labeled instances are chosen by the learner, rather than by nature), 



and in particular active learning for clustering. The most related work is Ailon et al. 2012 , who investigated 



active learning for correlation clustering. The authors obtain a (1 + e)-approximate solution with respect to 
the optimal, while (actively) querying no more than 0(npoly(logn, /c, e^^)) edges. The result imposed no 
restriction on cluster sizes and hence inspired this work, but differs in at least two major ways. First, |Ailon| 
et al. 2012 did not consider exact recovery as we do. Second, their guarantees fall in the ERM (Empirical 
Risk Minimization) framework, with no running time guarantees. Our work recovers true cluster exactly 
using a convex relaxation algorithm, and is hence computationally efficient. The problem of active learning 



has also been investigated in other clustering setups including clustering based on distance matrix Voevodski 



et al.| 2012 , Shamir and Tishby 2011 , and hierarchical clustering Eriksson et al. 2011 , Krishnamurthy 



et al. 2012 . These setups differ from ours and cannot be easily compared. 



2 Notation and Setup 

Throughout, V denotes a ground set of elements, which we identify with the set [n] = {1, . . . , n}. We assume 
a true ground truth clustering of V given by a pairwise disjoint covering Vi, . . . , Vfc, where k is the number 
of clusters. We say « ~ j if i,j G 14 for some a G [k], otherwise i '/^ j. We let Ui = \Vi\ for all i G [k]. For 
any i € [n], (i) is the unique index satisfying i e V(i) . 

For a matrix X E M"^" and a subset S C [n] of size to, the matrix X[S] E is the principal minor 

of X corresponding to the set of indexes S. For a matrix M, r(Af ) denotes the support of Al, namely, the 
set of index pairs such that M{i,j) ^ 0. 

The ground truth clustering matrix, denoted K* , is defined so that K*{i,j) = 1 is i j, otherwise 0. 
This is a block diagonal matrix, each block consisting of I's only. Its rank is k. The input is a symmetric 
matrix A, a noisy version of K* . It is generated using the well known planted clustering model, as follows. 
There are two fixed edge probabilities, p > q. We think of A as the adjacency matrix of an undirected random 
graph, where edge is in the graph for i > j with probability p if i ~ j, otherwise with probability g, 
independent of other choices. The error matrix is denoted by B* := A — K* . We let := r{B*) denote the 
noise locations. 

Note that our results apply to the more practical case in which the edge probability of {i,j) is pij for 
each I ~ j and qij for i 7^ j, as long as (mm pij) =: p > q := (maxg^j). 

3 Results 

We remind the reader that the trace norm of a matrix is the sum of its singular values, and we define the 
£1 norm of a matrix M to be ||Af||i = "l^ij Consider the following convex program, combining the 

trace norm of a matrix variable K with the £1 norm of another matrix variable B using two parameters ci , C2 
that will be determined later: 

(CPl) min lli^ll^ + ci \\rriA)B\\^ + c^ \Vv(,ayB\^ 
s.t. K^B = A 

0<K,, < l,V(i,j). 

Theorem 1. There exist constants 61, 63, 64 > such that the following holds with probability at least 1 — 
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Black represents 1, white represents 0. (TrainiK) is the 
^1 side length of the smallest black square. 



Figure 1: A partial clusering matrix K. 



For any parameter k > 1 and t G [jp + ^q, |p + -gq\, define 

K^p{l-q)n 2 n , - Q)n 

H = 03— log n 4 = . (1) 

p-q p-q 

If for all i G [k], either rii > or rii < and if (K, B) is an optimal solution to ( CPl ), with 



c,^ i^J^ c.^ ^J^, (2) 
KynlognV t K\/n\ogji\ 1 — t 



then (K, B) — (V^K* , A — K), where for a matrix M , V^M is the matrix defined by 

{V^M){i,j) 



M{i,j) max{n(j) , n^^) } > ^ 
otherwise 



(Note that by the theorem's premise, K is the matrix obtained from K* after zeroing out blocks 



corresponding to clusters of size at most 40 The proof is based on Chen et al. 2012 and is deferred to the 
supplemental material due to lack of space. The main novelty in this work compared to previous work is the 
treatment of small clusters of size at most whereas in previous work only large clusters were treated, and 
the existence of small clusters did not allow recovery of the big clusters. 

Deflnition 2. An nxn matrix K is a partial clustering matrix if there exists a collection of pairwise disjoint 
sets Ui, . . . ,Ur C [n] (the induced clusters) such that K{i,j) = 1 if and only if i,j e Ug for some s G [r], 
otherwise 0. If K is a partial clustering matrix then crinin(-f^) is defined as min[^]^ \Ui\- 

The definition is depicted in Figure [T] Theorem [I] tells us that by choosing k (and hence ci, C2) properly 
such that no cluster size falls in the range (£t.,^tt)' unique optimal solution {K,B) to convex program 
(CPl) is such that is a partial clustering induced by big ground truth clusters. 

In order for this fact to be useful algorithmically, we also need a type of converse: there exists an event 
with high probability (in the random process generating the input), such that for all values of k, if an optimal 
solution to the corresponding (CPl) looks like the solution {K, B) defined in Theorem [l] then the blocks of 
K correspond to actual clusters. 

Theorem 3. There exists constants Ci,C2 > such that with probability at least 1 — n~^, the following 
holds. For all k > 1 and t G [|g + jp, ^q + if {K,B) is an optimal solution to (CPl) with Ci,C2 as 
defined in Theorem [7} and additionally K is a partial clustering induced by Ui, . . . ,Ur Q V , and also 



I Ciklogn C2Ky/p{l - q)nlogn [ 
o'min(-«) > max < r^, } , (3) 

[{p-qr p-1 J 

then Ui, . . . ,17,. are actual ground truth clusters, namely, there exists an injection (p : [r] t-^ [k] such that 
Ui = for all ie [r]. 
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(Note: Our proof of Theoreni|3]uses Hoeffding tail bounds for simplicity, which are tight for q bounded 
away from and 1. Bernstein tail bounds can be used to strengthen the result for other classes oip,q. We 
elaborate on this in Section [O] ) 

The combination of Theorems [l] and |3] implies that, as long as there exists a relatively small interval 
which is disjoint from the set of cluster sizes, and such that at least one cluster size is larger than this 
interval (and large enough), we can recover at least one (large) cluster using (CPl). This is made clear in 
the following. 

Corollary 4. Assume we have a guarantee that there exists a number a > 64 ^^^^^^ -^ , such that no cluster 
size falls in the interval (a, j^a log^ n) and at least one cluster size is of size at least s :— max{^a log^ n, (Cifclog 
g)^, C2\/p{^ — q)nlogn/{p — q)}. Then with probability at least 1 — n^^ , we can recover at least one cluster 



of size at least s efficiently by solving ( CPl ) with hi — a/ 64 




Of course we do not know what a (and hence k) is. We could exhaustively search for a k > 1 and 
hope to recover at least one large cluster. A more interesting question is, when is such a k guaranteed 
to exist? Let g — ^b^Xo^ n. The number g is the (multiplicative) gap size, equaling the ratio between 
£j and £[, (for any k). If the number of clusters fc is a priori bounded by some fco, we both ensure that 
there is at least one cluster of size n/fco, and by the pigeonhole principle, that one of the intervals in the 
sequence {n/gkQ,n/kQ), {n/g'^ko,n/ gko), . . . , {n/g'^°'^^ko,n/g'^°ko). is disjoint of cluster sizes. If, in addition, 
the smallest interval in the sequence is not too small and n/ko is not too small so that Corollary |4] holds, 
then we are guaranteed to recover at least one cluster using Algorithm [ij We find this condition difficult to 
work with. An elegant, useful version of the idea is obtained if we assume p, q are some fixed constantsj^ As 
the following lemma shows, it turns our that in this regime, kg can be assumed to be almost logarithmic in 
n to ensure recovery of at least one cluster. In what follows, notation such as C{p, q), C^{p, q), ■ ■ ■ denotes 
universal positive functions that depend on p, q only. 

Lemma 5. There exists C:i{p,q)^C/^{p,q),C^ > such that the following holds. Assume that n > Ci{p,q), 
and that we are guaranteed that k < kg, where fco = '^^lo^'klg'^" • Then with probability at least 1 — n^^ 
Algorithm^ will recover at least one cluster in at most C^ko iterations. 

The proof is deferred to the supplemental material section. Lemma [5] ensures that by trying at most a 
logarithmic number of values of k, we can recover at least one large cluster, assuming the number of clusters 
is roughly logarithmic in n. The next proposition tells us that as long as this step recovers the clusters 
covering at most all but a vanishing fraction of elements, the step can be repeated. 

Proposition 6. A pair of numbers {n' ,k') is called good if n' < n,k' < k and k' < ^^ioglog°n" ■ V {n-'^k') 
is good, then {n" ,k") is good for all n" ,k" satisfying n' > n" > ^'/(logn)^/^^'^'^) and k' -l>k" > 1. 



The proof is trivial. The proposition implies an inductive process in which at least one big (with respect 
to the current unrecovered size) cluster can be efficiently removed as long as the previous step recovered at 
most a (1 — (log n)^^/'^-'(^''''))-fraction of its input. Combining, we proved the following: 

Theorem 7. Assume n,k satisfy the requirements of Lemma^ Then with probability at least 1 — 2n^^ 
Algorithm^ recovers clusters covering all but at most a ((log n)^^/'^^*^^^')) fraction of the input in the full 
observation case, without any restriction of the minimal cluster size. Moreover, if we assume that k is 
bounded by a constant kg, then the algorithm will recover clusters covering all but a constant number of 
input elements. 



^In fact, we nee d only fix (p — q), b ut we wish to keep this exposition simple. 

^In comparison, |Ailon et al.| [2012| require fco to be constant for their guarantees, as do the Correlation Clustering PTAS 
IGiotis and Guruswamil 120061 . 
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3.1 Partial Observations 



We now consider the case where the input matrix A is not given to us in entirety, but rather that we have 
oracle access to A{i,j) for of our choice. Unobserved values are formally marked with A{i,j) =?. 

Consider a more particular setting in which the edge probabilities defining A are p' (for i ^ j) and q' 
(for i 7^ j), and we observe Aii^j) with probability p, for each z, j, independently. More precisely: For i ^ j 
we have A{i,j) = 1 with probability pp' , with probability p(l — p') and ? with remaining probability. For 
i '/^ i we have A{i,j) — 1 with probability pg', with probability p(l — q') and ? with remaining probability. 
Clearly, by pretending that the values ? in A are 0, we emulate the full observation case with p = pp', q = pq' . 

Of particular interest is the case in which p' ^q' are held fixed and p tends to zero as n grows. In this 
regime, by varying p and fixing k = 1, Theorem [T] implies the following: 

Corollary 8. There exist constants 5i(p', g'), 63(p', q'), 64(p', q'), 65(^3', g') > such that Jor any sampling 
rate parameter p the following holds with probability at least 1 — . define 

h = h{p\q')^\oin £,^b,ip',q')y^ . 
If for all i e [k], either rii > or Ui < t\, and if (K, B) is an optimal solution to ( CPl ), with 



Cl 



C2 



6i(p',g') ll -b^{p',q')p 

h{p',q')p 




y/n\ogn]j l~bz{p\q')p 
then {K,B) = {V^K* , A — K), where is as defined in Theorem^ 

(Note: We've abused notation by reusing previously defined global constants (e.g. 61) with global 
functions of p', q' (e.g. bi(p' , q'))-) Notice now that the observation probability p can be used as a knob for 
controlling the cluster sizes we are trying to recover, instead of k. We would also like to obtain a version of 
Theorem [3j In particular, we would like to understand its asymptotics as p tends to 0. 



Algorithm 1 RecoverBigFullObs(V, j4,p, g) 

require: ground set V, A £ MX^^ , probs p, q 
\V\ 

jP+h (or anything in [^p + |g, |p + \q]) 
i^^n.g^^ log2 n 

II (If have prior bound fco on num clusters, 
// take -s— nlko) 

while ^ > niax|^pHp, ^i^^lif^l do 

solve for k using ([T]), set Ci,C2 as in ^ 

{K, B) ^ optimal solution to (CPl) with ci, C2 

if K partial clustering matrix with aminiK) > £^ then 

return induced clusters {Ui, . . . ,Ur} of K 
end if 

£j ^ et/9 

end while 
return 
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Algorithm 2 RecoverFullObs(y, 



require: ground set V, matrix A e E^^^, probs p, q 
{Ui,...,Ur}^ RecoverBigFullObs(\/, A,p, q) 
y [n] \ (C/i U • • • U Ur) 
if r = then 

return 
else 

return RecoverFullObs(F', A[V'],p, q) U {C/i, . . . , C/^} 
end if 



Theorem 9. There exist constants Ci{p' ,q'),C2{p' , q') > such that for all observation rate parameters 
p < 1, the following holds with probability at least 1 — n^^. // (K,B) is an optimal solution to (CPl) with 
Ci,C2 as defined in Theorem^ and additionally K is a partial clustering induced by Ui, . . . ,Ur C V, and 
also 

fj..^ ( Ci{p',q')klogn C2{p',q')y/n\ogn\ 
CTmin(i^) > max <^ , )■ , (4) 

then Ui,...,Ur are actual ground truth clusters, namely, there exists an injection ^ : [r] H> [k] such that 
Ui = V^(i) for all i e [r]. 

The proof can be found in the supplemental material. Using the same reasoning as before, we derive 
the following: 

Theorem 10. Let g — b'i{p' ,q')/bi{p' ,q')\o^ n (with bj,{p' ,q'),bi{p' ,q') defined in Corollary^. There 
exists a constant Ci{p' ,q') such that the following holds. Assume the number of clusters k is bounded by 

some known number /cq < ^^{p' ,q'){\ogn)/{\Qg\ogn). Let po = ^^''^ ^ n" " • Then there exists p in the 
set {po,pg, . . . iPg*""} for which, if A is obtained with observation rate p (zeroing I's), then with probability 
at least 1 — n^^, any optimal solution {K,B) to (CPl) with ci,C2 from Corollary^ satisfies Q). 

(Note that the upper bound on kg ensures that pg''" is a probability.) The theorem is proven using a 
simple pigeonhole principle, noting that one of the intervals ^tl(p)) must be disjoint from the set of 

cluster sizes, and there is at least one cluster of size at least n/ko. The theorem, together with Corollary [s] 
and Theorem [9] ensures the following. On one end of the spectrum, if fcp is a constant (and n is large enough), 
then with high probability we can recover at least one large cluster (of size at least n/ko) after querying no 
more than 

'MpV) 



2ko \ 

OUfcol^^T^Vn) (5) 



values of A{i,j). On the other end of the spectrum, if ko < i5(logn)/(loglogn) and n is large enough 
(exponential in 1/(5), then we can recover at least one large cluster after querying no more than n^+^^^'l 
values of A{i,j). (We omit the details of the last fact from this version.) This is summarized in the 
following: 

Theorem 11. Assume an upper bound k^ on the number of clusters k. As long as n is larger than some 
function of ko,p' , q' , Algorithm^ will recover, with probability at least 1 — , at least one cluster of size 
at least n/ko, regardless of the size of other (small) clusters. Moreover, if ko is a constant, then clusters 
covering all but a constant number of elements will be recovered with probability at least 1 — , and the 
total number of observation queries is hence almost linear. 

Note that unlike previous results for this problem, the recovery guarantee does not impose any lower 
bound on the size of the smallest cluster. Also note that the underlying algorithm is an active learning one, 
because more observations fall in smaller clusters which survive deeper in the recursion of Algorithm [4] 
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4 Experiments 



We experimented with simplified versions of our algorithms. Here we did not make an effort to compute the 
various constants defining the algorithms in this work, creating a difficulty in exact implementation. Instead, 
for Algorithm [l] we increase k by a multiplicative factor of 1.1 in each iteration until a partial clustering 
matrix is found. Similarly, in Algorithm [3j p is increased by an additive factor of 0.025. Still, it is obvious 
that our experiments support our theoretical findings. A more practical "user's guide" for this method with 
actual constants is subject to future work. 



In all experiment reports below, we use a variant of the Augmented Lagrangian Multiplier (ALM) 



method Lin et al. 2009 to solve the semi-definite program (CPl). Whenever we say that "clusters {Vi-^ , , . . . } 



were recovered", we mean that a corresponding instantiation of (CPl) resulted in an optimal solution (isT, B) 
for which K was a partial clustering matrix induced by {V^^ , Vi^, . . . }. 



Experiment 1 (Full Observation) Consider n = 1100 nodes partitioned into 4 clusters Vi, . . . ,14, of 
sizes 800, 200, 80, 20, respectively. The graph is generated according to the planted partition model with 
p — 0.5, q — 0.2, and we assume the full observation setting. We apply a simpHfied version of Algorithm [2] 
which terminates in 4 steps. The recovered clusters at each step are detailed in Table [T] 



Experiment 2 (Partial Observation - Fixed Sample Rate) We have n = 1100 with clusters Vi, . . . , V4 
of sizes 800,200,50,50. The observed graph is generated with p' = 0.7, q' — 0.1, and observation rate 



Algorithm 3 RecoverBigPartialObs(y, fcg) (Assume p', 5' known, fixed) 

require: ground set V, oracle access to A£ BX^"^ , upper bound fcg on number of clusters 
\V\ 

b3{p',q')^ko log^ n 
PO ^ n 

g^b3{p',q')/bi{p',q') log' n 
for s e {0; . . . , ko} do 

P ^ Pog" 

obtain matrix A e {0, 1, by sampling oracle at rate p, then zero ? values in A 

II (can reuse observations from prev. iterations) 

ci(p', 9'): C2(p', 9') ^ as in Corollary [s] 

(if, B) ^ an optimal solution to (CPl) 

if K partial clustering matrix satisfying Q then 
return induced clusters {Ui, . . . , Ur} 

end if 
end for 
return 



Algorithm 4 RecoverPartialObs(y, fco) (Assume p', g' known, fixed) 

require: ground set V, oracle access to A G M^^^, upper bound fco on number of clusters 
{Ui,...,Ur}^ RecoverBigFullObs(T/, fco) 
y ^ [n] \ ([/i U • • • U Ur) 
if r = then 

return 
else 

return RecoverFullObs(y , fco - r) U {C/i, . . . , Ur} 
end if 
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Step k # nodes left Clusters recovered 

i i TToo Vi 

Experiment 1: 2 1 300 V2 

3 1 100 V3 

4 1 20 V4 



Experiment 2: 



Step 


K 


# NODES LEFT 


Clusters recovered 


1 


1 


1100 


Vi 


2 


1 


300 


V2 


3 


1 


100 


V3, Vi 



Experiment 3: 



Step 


P 


# NODES LEFT 


Clusters recovered 


1 


0.2 


1100 


Vi 


2 


0.4 


300 


V2 


3 


0.95 


100 


V3, Vi 



Step p # nodes left Clusters recovered 

1 0J5 4500 Vi 

Experiment 3A: 2 0.175 1300 V2 

3 0.2 500 V3, Vi 

4 0.475 100 V5, Ve 



Table 1: Experiment Results 



p = 0.3. We repeatedly solved (CPl) with ci,C2 as in Corollary [sj At each iteration, at least one large 
cluster (compared to the input size at that iteration) was recovered exactly and removed. This terminated 
in exactly 3 iterations. Results are shown in Table [l] 



Experiment 3 (Partial Observation - Incremental Sampling Rate) We tried a simplified version 
of Algorithm |4] We have n = 1100 with clusters Vi, . . . , V4 of sizes 800, 200, 50, 50. The observed graph is 
generated with p' = 0.7, g' = 0.3, and an observation rate p which we now specify. We start with p = and 
increase it by 0.025 incrementally until we recover (and then remove) at least one cluster, then repeat. The 
algorithm terminates in 3 steps. Results are shown in Table [l] 



Experiment 3A We repeat the experiment with a larger instance: n = 4500 with clusters Vi, . . . , Vg of 
sizes 3200,800,200,200,50,50, and p' = 0.8, = 0.2. Resuhs are shown in Table [l] Note that we recover 
the smallest clusters, whose size is below ^/n. 



Experiment 4 (Mid-Size Clusters) Our current theoretical results do not say anything about the mid- 
size clusters - those with sizes between and ^j. It is interesting to study the behavior of (CPl) in the pres- 
ence of mid-size clusters. We generated an instance with n = 750, {|Vi|, |T^2|, l^^sl, IV4I} = {500, 150, 70, 30}, 
p = 0.8,(7 = 0.2, and p — 0.12. We then solved (CPl) with a fixed k = 1. The low-rank part K of the 
solution is shown in Fig. [2| The large cluster Vi is completely recovered in while the small clusters 
V3 and V4 are entirely ignored. The mid-size cluster V2, however, exhibits a pattern we find difhcult to 
characterize. This shows that the polylog gap in our theorems is a real phenomenon and not an artifact of 
our proof technique. Nevertheless, the large cluster appears clean, and might allow recovery using a simple 
combinatorial procedure. If this is true in general, it might not be necessary to search for a gap free of 
cluster sizes. Perhaps for any k, (CPl) identifies all large clusters above €j after a possible simple mid-size 
cleanup procedure. Understanding this phenomenon and its algorithmic implications is of much interest. 
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Figure 2: The solution to (CPl) with mid-size clusters. 

5 Discussion 

An immediate future research is to better understand the "mid-size crisis" . Our current results say nothing 
about clusters that are neither big nor small, falling in the interval (4,^tj)- Our numerical experiments 
confirm that the mid-size phenomenon is real: they are neither completely recovered nor entirely ignored by 
the optimal K. The part of K restricted to these clusters does not seem to have an obvious pattern. Proving 
whether we can still efficiently recover large clusters in the presence of mid-size clusters is an interesting 
open problem. 

Our study was mainly theoretical, focusing on the planted partition model. As such, our experiments 
focused on confirming the theoretical findings with data generated exactly according to the distribution we 
could provide provable guarantees for. It would be interesting to apply the presented methodology to real 
applications, particularly big data sets merged from web application and social networks. 

Another interesting direction is extending the "peeling strategy" to other high-dimensional learning 
problems. This requires understanding when such a strategy may work. One intuitive explanation of the 
small cluster barrier encountered in previous work is ambiguity - when viewing the input at the "big cluster 
resolution" , a small cluster is both a low-rank matrix and a sparse matrix. Only when "zooming in" (after 
recovering big clusters), small clusters patterns emerge. There are other formulations with similar property. 



For example, in Xu et al. 2012 , the authors propose to decompose a matrix into the sum of a low rank one 
and a column sparse one to solve an outlier-resistant PCA task. Notice that a column sparse matrix is also 
a low rank matrix. We hope the "peeling strategy" may also help with that problem. 
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A Notation and Conventions 



We use the following notation and conventions throughout the supplement. For a real n x n matrix M, we 
use the unadorned norm |1M|1 to denote its spectral norm. The notation IIMHi? refers to the Frobenius norm, 

ll^^lli is T,r,j and ||M||eo is max,^- \M{i,j)\. 

We will also study operators on the space of matrices. To distinguish them from the matrices studied 
in this work, we will simply call these objects "operators", and will denote them using a calligraphic font, 
e.g. V. The norm \\V\\ of an operator is defined as 

\\V\\ = sup \\VM\\f , 

M:||M||f = 1 

where the supremum is over matrices M. 

For a fixed, real nx n matrix M, we define the matrix linear subspace T{M) as follows: 



T(M) := {YM + MX : X,Y & M"^"} . 

In words, this subspace is the set of matrices spanned by matrices each row of which is in the row space of 
M, and matrices each column of which is in the column space of M. 

For any given subspace of matrices 5 C M"^", we let Vs denote the orthogonal projection onto S with 
respect to the the inner product {X,Y) = -^(«,i)y(i,i) = tiX^Y. This means that for any matrix 

M, 

VsM = argmin^gs \\M - X\\f . 

For a matrix M, we let T{M) denote the set of matrices supported on a subset of the support of M. 
Note that for any matrix X, 

'm(z,j) X{i,j)^Q 
otherwise 



{Vrix)M){i,3) 



It is a well known fact that Vt{x) is given as follows: 

Vt{x)M = Pc(x)M + MPr^x) - Pc(x)MPr(x) , 

where Pc{x) is projection (of a vector) onto the column space of X, and Pr(x) is projection onto the row 
space of X. 

For a subspace 5 C M"^" we let S-^ denote the orthogonal subspace with respect to (•,•): 

= {X€ K"""" : (X, y) = VF e 5} . 

Slightly abusing notation, we will use the set complement operator (•)'= to formally define V{MY to be T{M)-^ 
(by this we are stressing that the space T{M)-^ is given as T{M') where M' is any matrix such that M and 
M' have complementary supports). Note that Vt(x)-^M = M — Vt{x)M = {I — Pc(x))M{I — Pr(x)) ■ 
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For a matrix M, sgnM is defined as the matrix satisfying: 

fl Af(i,j)>0 
(sgnM)(i,i) = <^ -1 M{i,j)<0 
I otlierwise 



B Proof of Theorem [1] 



The proof is based on Chen et al. 2012 . We prove it for n = 1. The adjustment for k > 1 is done using a 



padding argument, presented at the end of the proof. 
Additional notation: 

1. We let V\, C V denote the set of of elements i such that n^j^ < 4- (We remind the reader that 

= |V(j)|.) 



M max{n(i) , n^^) } > 
otherwise 



2. We remind the reader that the projection is defined as follows: 

3. The projection Vi, is defined as follows: 

(nM)(z,j) 



M{iJ) max{n(i),n(j)} < 4 
otherwise 



In words, Vi projects onto the set of matrices supported on Vj, x V\,. Note that by the theorem 
assumption, T^j + 7^|, = Id (equivalently, projects onto the set of matrices supported on {V x V)\ 

4. Define the set 

S) = {A e M"><"|A,, < 0,Vi ^ j, ^ H X H;0 < A,j,Vi ^ j, ^ H x V,} , 

which contains all feasible deviation from K. 

5. For simplicity we write T T{K) and F := F(i3),F^ :== F(B)'= = F-^. 

We will make use of the following: 

1. sgn(B) = B. 

2. Id = Vr+ Vro = Vr(A) + ^r(A)= • 

3. 'Pfi,V\,,Vr,'Pr'=,'Pr(A), and Vr{A)'' commute with each other. 
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B.l Approximate Dual Certificate Condition 



Proposition 12. {K, B) is the unique optimal solution to (CP) if there exists a matrix Q G M"^" and a 
positive number e satisfying: 

1. llOli < 1 

2. ||PT(Q)IU<imin{ci,C2} 

3. VA e S; 

(a) (UU^ + Q,VTiA)VrV^IS) = (1 + e)ci \\VTiA)PrV^^^ 

(b) (UU^ + Q,VriArVrViA) = (1 + e)c2 ||Pr(A)cPrnA||i 

I VAe2); 

(a) {UU^ + Q,Vt(a)Vt^V^A) > -(1 - e)ci ||7'r(A)^rc7'j A|| ^ 

(b) {UU^ + Q,VriArVr^ViA) > -(1 - e)c2 \\VriArPr^ViA\\^ 

5. VrniUU^ + Q) = ciPi,B 

6. \\VrcVi,{UU^ +Q)\\^<C2 

Proof. Consider any feasible solution to (CPl) {K + A,B — A); we know A € D due to the inequality 
constraints in (CPl). We will show that this solution will have strictly higher objective value than (^K, B^ 
if A ^ 0. 

For this A, let Ga be a matrix in T-^ n Range(H) satisfying ||G|| 1 and (Ga, A) = \\'Pt^'Pi,A\\^ ; 
such a matrix always exists because RangcP^ C T^. Suppose \\Q\\ = b. Clearly, V^^Q + i^ — b)G& T-"- and, 
due to desideratum 1, we have \\Vt^Q + (1 - b)GA\\ < \\Q\\ + {1 - b) \\Ga\\ = 6 + (1 - 6) = 1. Therefore, 
UU'^ + Vt±Q + (1 - 6)Ga is a subgradient oi fiK) = \\K\\^ at K = K. On the other hand, define the 
matrix Fa = — ■Pr<=sgn(A). We have .Fa € and ||-Fa|Ioo < 1- Therefore, ■Pr(A)(-B + -Fa) is a subgradient 
of gi{B) = \\Vt(a)B\\^ at B = B, and 'Py(ay-{B + Fa) is a subgradient of g2{B) = ^VriAy-B^i^^ at B = B. 
Using these three subgradients, the difference in the objective value can be bounded as follows: 



d{A) 



k + A 


+ Cl 

* 


Vr(A){B-A) 


+ C2 

1 


Pr{A)c(B-A) 


1 


k 


- Cl 

* 


'Pt(A)B 


- C2 

1 


Vr(A)'=B 



> (UU^ + Pt^Q + (1 - 6)Ga, a) + Cl (rr{A){B + Fa), -A) + C2 {rr{A)4B + Fa), -A) 
= (1-6) \\Vt±Vi>A\1 + {UU^ + Vt±Q, a) + Cl (VriA)B, -A) + (pr(A)cS, -A) 

+ci {Vt(a)Fa, -a) + C2 {Vv(ayFa,-A) 
= (1-6) rr^nAIL + {UU^ + Vt^Q, A) + ci (n^r(A)^, -A) + C2 (v^VviAYB, -A) 
+ci (PiPr(A)B, -a) + C2 (ViVriArB, -A) + ci (Pr(A)^^A, -A) + C2 (Pr(A)c-FA, -A) . 
The last six terms of the last RHS satisfy: 

1. Cl (Vi,Vr(A)B, -A^ + C2 IV\,Vt{ayB, -A^ = ci (V\,B, -A^, because Vi,B G T{A). 



15 



2. (PtjPr(A)S,-A) > -\\V^Vr[A)VT^\^ and (P,Pr(A)<^S, A) > - ||PjPr(A)<=^rA||^, because B G T 



and 



B 



< 1. 



3. {Vt(a)Fa,~^) = ||'Pr(A)^r-A||^ and (7'r(A)<^^A, -A) = ||7'r(A)<^^r<=A||^, due to the definition of i^. 
It follows that 

d(A) > (1-6) llP^^nAIL + {UU^ + ^T^Q. A) + ci (ns, -a) - ci ||PjPr(A)^rA||^ 

||7'j7'r(A)=7'rA||^ +ci ||7'r(A)7'r=A||^ + C2 ||7'r(A)=7'r=A||^ . (6) 

Consider the second term in the last RHS, which equals {UU^ + Vt^Q, A) = {UU^ + Q, 'P^A) + {UU^ + Q, HA)- 
{VtQt^)- We bound these three separately. 

First term: 

(c/c/^ + g,pjA) 

= {UU^ + Q, {VriA)VrVt + VriAyVrV^ + VT(A)Vr^V^ + Vr(ArVr^V^) A) 

> (1 + e)ci \\VT(A)PTVi^\^ + (1 + e)c2 ||7'r(A)-^r7'B A|| ^ - (1 - e)ci ||7'r(A)^r=^BA||^ 

-(l-e)c2||7'r(A)=Pr-PBA||^ 

(Using properties 3 and 4) 

Second term: 

= {VMUU^ + Q), A) + {Vt^-V\,{UU'^ + Q), a) 

> ci (V\)B, — C2 llPr^^AII^ (using properties 5 and 6) 

= ci (V\,B,Aj - C2 IIPrcAj^^r^HAll^ (because Pr(A)'=^r=H = ^r<=H) 

Third term: Due to the block diagonal structure of the elements of T, we have Vt = V^Vt 

{-VtQ, A) 

> -llT'Tgilo.llT'BAII, 

> -^min{ci,C2}|l7'8A|l^. 

Combining the above three bounds with Eq. ([6|, we obtain 
d(A) 

> (1-6) \\Vt±Vi,A\\^ + eci \\ViVriA)'PrA\\^ + ec2 \\ViVriArVrA\\^ + ea \\VriA)'Pr^'PtA\\^ 
+ec2 \\VriArVrr^ViA\\^+ci HP^Aj^r^HAH^ - ^ min{ci,c2} HT'^AH^ 

= (1-6) llT'T^nAII, + eci \\VtVriA)A\\^+ec2 \\ViVr(AyA\\^ - |min{ci,C2} H^jAH^ 
(note that 7'r(A)^r-H A=0) 

> (l-6)||nA|L + |min{ci,C2}||7'BA|l,, 

which is strictly greater than zero for A 7^ 0. □ 
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B.2 Constructing Q. 



We construct a matrix Q with the properties required by Proposition 
and use the weights Ci and C2 given in Theorem [T] We specify V^Q 
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Suppose we take e = " , A^j^^-^ 



and V\fQ separately. 
■PoQ is given by V^Q = V^Qi + -PnQs + ^jQa, where for (i, j) ^ H x H, 



1 



i 7^ j 

-(i + e)c2 i-'j, («,j)er 

ZT^j 

■(i + e)ci «7^j,(i,j)er 

.0 ' 



Note that these matrices have zero-mean entries. 
HQ as follows. For (i, j) e Vj, x H, 



no 



'ci i~j,(i,j)er(A) 
ci i/j, (i,i)er(A) 



where 



-1 with probability 



1 — 1 with remaining probability 



B.3 Validating Q 

Under the choice of t in Theorem n] we have \p <t <p and \{^ — q) < 1 — 1< 1 — <?. Also under the second 

n lo ^ n lo ^ n 

assumption of the theorem and p — q < p(l — q), we have p{l — q) }^ " °f " > °f " . We will make use of 
these facts frequently in the proof. 

It is easy to check that e := < h under the assumption of Theorem 

Property 1): 

Note that \\Q\\ < \\Vf^Q^\\ + WV^Q^W + IIHO-II + IIHQt^II- We show that aU four terms are upper- 
bounded by |. 

(a) V\,Qr^ is a block diagonal matrix with each block having size at most Moreover, V\,Qr^ is the 
sum of a deterministic matrix ^ with all non-zero entries equal to , F^^ and a random matrix 

^ ^ \/«logn ^t(l-t) 

O^^r whose entries are i.i.d., bounded almost surely by max{ci,C2} and have zero mean with variance 
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fji^. Therefore, we have ||Q^|1 < 



1, where w.h.p. 



bi 



p-t 



^/n\ogn y/t{l - t) ' 



linQ^ 



< 6 max < £b log i 



bi 



\Jn log 



t{\-t) 



max{ci,C2}log^n 



here in the second inequality we use Lemma 16 We conclude that ||7^bQ~ll is bounded by ^ as long as 
4 < ^*gj)^(p"f°^" and max{ci, C2} < jg^"^: which holds under the assumption of Theorem jlj 

(b) 'P\,Q^ is a random matrix supported on Vj, x Vj,: whose entries are i.i.d., zero mean, bounded almost 
surely by maxjci, C2}, and have variance ^ ^ • ^-^^1^^. It follows from Lemma 16 that 

61 



llHQ/ll < 6 max ■ 



because n\, <n and max{ci,C2} < 



t^ + q-2tq 

VnlS^V il-t)t 



, max{ci,C2}log n> < 



1 



48 loe^ n ' 



which holds under the assumption of the theorem. 



(c) Note that V^^Qr^ = 'PjQi + 'PjQ2- By construction these two matrices are both block-diagonal, 
have i.i.d zero-mean entries which are bounded almost surely by := max||^, ^| and have variance 

bounded by cr^ :— max ^^^y^Cjj. Lemma 16 gives ||'PdQ~|| < 6max {y^ • a^,B^ log^ n] < j under 

the assumption of Theorem [T] 

(d) Note that V^Q^ = 'PdQs is a random matrix with i.i.d. zero-mean entries wh ich are bounded 



:— j^c^. Lemma 
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gives WV^Q^l 



< 



almost surely by := and have variance bounded by cr^ 
6 max { ^/n ■ , B^ log^ n\ <\- 

Property 2): 

Due to the structure of T, we have 

II^tQIL = IIWbQIL = \uu^{v^Q) + {ViQ)uu^ + uu^iViQ)uu^\\^ 

3 

< 3\\UU^rtQ\\^<3Y.P^^'PiQ"^\\oc- 

Now observe that {UU^V^Qm){i^j) — Tli^v i ) ^) i^ '^^ i.i.d. zero-mean random vari- 

ables with bounded magnitude and variance. Using Lemma 18, we obtain that for i € Vj, 



\iUU-V,Q.)i^,j)\ < 4(^V«c,)logn- 



log n 



< 



1 /logn logn ft 



< 



where in the last inequality we use t > j> For i e H, clearly {UU'^'P(,Qi){i, j) — 0. By imion bound 



we conclude that \\UU^ V^QiW^ < ^^Jj- Similarly, we can bound \\UU^ ■PiQ2\\^ and WUVV^Qs 
with the same quantity (cf. Chen et al. 2012| ). 
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On the other hand, under the definition of Ci, C2 and e, we have 

/ 1 ~ t 2 log^ n I n -y/plogn log n ft log n ft 

Cie = Oix — • J -J- ^ = 61 -WTTTX -> 



tnlogn V ^(1 tVi 24€tt V P ~ 24£p V P 

and similarly C2e > ^^y^- It follows that H'PtOHoo — ^ ' ^fiiiiii {ci, C2}, proving property 2). 

Properties 3a) and 3b) 

For 3a), by construction of Q we have 

{UU^ +Q,Vr{A)VrrtA) = {rr{A)rrViQ3,rriA)'PrriA) 

= (l + e)ci Yl n^(^'i) 

(i,i)Grnr(A) 

= (1 + e)ci \\Vr{A)'Pr'PtA\\^ (because A € 2)) 

Property 3b) can be verified similarly. 
Properties 4a) and 4b): 

For 4a), we have 

(UU^ + g,Pr(A)PrcPsA) = {Vr(A)VToV^ {UU^ + V^Qi + PjQs) ,Pr(A)^r^nA) 

(here we use A e 3), ftj > p, and 71^(1) > ^jfor i e Vj). 
Consider the two terms in the parenthesis in the last RHS. For the first term, we have 



1 21og^n I n / t{l-t) 21og^n I n , / 1-t 

~ < — 7, \ hrr. 7^-bi\—, = eci. 



pi^ Y i(l - 1) Y Vnlog^n - Y *(1 - V inlogn 

For the second term, we have the following 

_ 63 log^n-v/p(l - q)n 

^ - 4 - 4 

&3 V^(l - g) 2 log' 

which implies (1 + e)c2^^ ^ (1 ~ 2e)ci. We conclude that 

{UU'^ ^Q,Vv^A)PvcV^A) > - (eci + (1 - 2e)ci) ||Pr(A)-PrcnA||i , 
proving property 4a). 

For 4b), we have 

(UU^ +Q,VriA)cVroViA) = {Vr(A)cVTcViQ3,Vr(A)cVroViA) 

(i,j)er(A)<=nr<=nRange-Pj ^'-^ 
> -(1 + \\'Pr{A)c'PrcViA\\^ . (here we use qij < q) 
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Consider the factor before the norm in the last RHS. Similarly as before, we have 

p-q bs log^ n^p{l - q)n 
t — q> — - — > -; 



> 2-t{l-q) 



2 log n\/n 



2t{l-q)e. 



- t) 

This implies (1 + e)ci < (1 — e)c2- We conclude that 

{UU^ + Q,VT(AYVT^Vil:^) > -{1 ~ e)c2\\rriAyVr^ViA\\^ , 

proving property 4b). 

Properties 5) and 6): It is obvious that these two properties hold by construction of Q. 
Note that properties 3)-6) hold deterministically. 



B.4 The K> 1 case 



Let n' = K n and assume n' is an integer. Let A' e M" ^" be such a matrix that 



A' 



A 
/ 



Consider the following padded program 

(CPr) min \\K'l^+ci\\rriA')B'\\,+C2\\rriA'rB'h 



s.t. K' + B' = A' 

0<< < l,V(i,j). 

Applying Theorem [T] with k, — 1 (which we have proved) to A' and the padded program (CPl'), we conclude 
that the unique optimal solution [K' , B' = A' — K') to (CPl') has the form 



K' = 



V^K* 




We claim that K = V^K* is the unique optimal solution to (CPl). 

Proof by contradiction: suppose an optimal solution to (CPl) is K ~ Kq ^ V^K* . By optimality we 

have 

ll^olL+ci||Pr(A)(^-i^o)||i+C2||7'r(A)=(^-^o)||i < rsi^H.+ci ||7'r(^)(A - 7's/r)||^+c2 ||Pr(A)=(A - T'sif*]]^ , 

It follows that 



Define K'^ 









\\K',\l + ci WVriA'M' ~ K'^)\l + C2 \\Vv(A'y{A' ~ ^o)|li 
= ll^olL +ci \\Vr(A){A-K^)\\^+ci{n' ~n)+C2 ||7'r(A).= (^ - ^o)||i 
< \\V^K*\\,^ + ci \\VTiA){A - V^K*)\\^ + Clin' - n) + ||^r(A)^(A - V^K*)\\^ 



k' 



Ci 



Vr(A'){A' - k') +C2 VriA'Y{A'-K') 



contradicting the fact that {k\ & — A' — k') is the unique optimal to (CPl'). 
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C Proof of Theorem [3] 



Fix K > 1 and t in the allowed range, let {K, B) be an optimal solution to (CPl) , and assume i^T is a partial 
clustering induced by Ui, . . . ,Ur for some integer r, and also assume crmin(-f^^) — minjgjr] satisfies (jsj). 
Let M = cr,„i„(A'). 

We need a few helpful facts. First, note that any value of t in the allowed range [^p + ^q, |p + |g] 
satisfies q + \ {p ~ q) < t < p — ^{p — q) . Also note that from the definition of t, ci,C2, 

q+-Ap-q)<^^=t<p-\{p-q). (7) 
4 ci + C2 4 

We say that a pair of sets Y QV, Z QV \s cluster separated if there is no pair (y, z) G y x Z satisfying 
y ^ z. 

Assumption 13. There exists a constant C" > such that for all pairs of cluster- separated sets Y, Z of size 
at least m := '^pl"^)" each, 

\dY,Z-q\<\[p-q) , (8) 

where dy^z ■= "'^y|^zj"' - 

This is proven by a Hoeffding tail bound and a union bound to hold with probability at least 1 — n^^. 
To see why, fix the sizes mY,mz of |Z|, assume my < mz w.l.o.g. For each such choice, there are at 
most exp{C(my + mz)\ogn} < exp{2CTO2 logri} possibilities for the choice of sets Y,Z, for some C > 0. 
For each such choice, the probability that ([s]) does not hold is 

eM-C"mymz{p-qf} (9) 

using Hoeffding inequality, for some C" > 0. Hence, as long as my > m as defined above, for properly 
chosen C", using union bound (over all possibilities of my, mz and of Y, Z) we obtain ^ uniformly. 

If we assume also , say, that 

M > 3m , (10) 

(which can be done by setting Ci > 3C") the implication of the assumption is that it cannot be the case that 
some Ui contains a subset U[ of size in the range [m, \Ui\ — m] such that U[ = n C/^ for some g. Indeed, if 
such a set existed, then we would find a strictly better solution to (CPl), call it [K' , B'), which is defined so 
that K' is obtained from K by splitting the block corresponding to Ui into two blocks, one corresponding to 
U- and the other to Ui \ U[. The difference A between the cost of {K, B) and {K' , B') is (renaming Y :— U[ 
and Z ■=U\U'i) A = ci|(y X Z)nn|-C2|(y x Z)nf7'=| = {ci+C2)dy^z\Y\\Z\-C2\Y\ \Z\. But the sign of 
A is exactly the sign of dy,z ~ ci+c2 '^t'ich is strictly negative by m\ and (TL (We also used the fact that 
the trace norm part of the utility function is equal for both solutions: = ||-ftr||*). 

The conclusion is that for each i, the sets [Ui n Vi), ...,([7^0 Vk) must all be of size at most to, except 
maybe for at most one set of size at least \Ui\ — m. If we now also assume that 

M > km = {kC \ogn) / {p - qf , (11) 

then we conclude that not all these sets can be of size at most to. Hence exactly one of these sets must have 
size at least \Ui\ — m. From this we conclude that there is a function </) : [r] i— >■ [k] such that for all i G [r], 

\U^r\V^(,)\ > \U^\-m. 
We now claim that this function is an injection. We will need the following assumption: 
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Assumption 14. For any 4 pairwise disjoint subsets {Y,Y' , Z, Z') such that (Y U Y') C Vi for some i, 
(ZUZ') C [n]\V„ meix{\Z\,\Z'\} <m, niin{|r|, |r'|} > M - m; 

\Y\ ■ \Y'\ dY.y - \Y\ ■ \Z\ dy.z - \Y'\ ■ \Z'\ dy^z' > 

-^i\Y\-\Y'\^\Y\-\Z\-\Y'\-\Z'\) (12) 

Cl + C2 

The assumption holds with probabihty at least 1 — by using Hoeffding inequality, union bounding 
over all possible sets Y,Y' , Z, Z' as above. Indeed, notice that for fixed mY,mY',mz,mz' (with, say, 
my > rny), and for each tuple Y,Y',Z,Z' such that |y| = r7iy,|y| = my',|Z| = mz,\Z'\ = niz', the 



probability that ( 12 ) is violated is at most 

exp{— C(p — qY{mYmY' + niYmz + mY'mz')} (13) 



for some C > 0. Using (10), this is at most 

exp{-C"(p - qfiniYmY')} , (14) 

for some global C" > 0. Now notice that the number of possibilities to choose such a 4 tuple of sets is 
bounded above by exp{C""my logn}, for some global C" > 0. Assuming 

for some C, and applying a union bound over all possible combinations Y, Y' , Z, Z' of sizes my , my ,mz,mz' 



respectively, of which there are at most exp{C°my log n} for some C° > 0, we conclude that ( 12 ) is violated 
for some combination with probability at most 

exp{-C"(p - qfmYmY'/2} (16) 

which is at most exp{— 201ogn} if 

for some C" > 0. Apply a union bound now over the possible combinations of the tuple (my, my, mz, mz') , 



of which there are at most exp{4 log n} to conclude that ( 12 ) holds uniformly for all possibilities of Y, Y' , Z, Z' 
with probability at least 1 — n^^. 

Now assume by contradiction that cf) is not an injection, so 4>{i) = (j^i^') —'■ j for some distinct i^i' G [r]. 
Set Y = U^C^VJ,Y' = U^>nVj, Z = Ui\Y,Z' ^ Ui,\Y'. Note that max{|Z|, |Z'|} < m and min{|r|, > 
M—m. Consider the solution {K' , B') where K' is obtained from K by replacing the two blocks corresponding 



to C/i, Ui' with four blocks: Y, Y' , Z, Z' . Inequality ( 12 ) guarantees that the cost of (X', B') is strictly lower 



than that of {K,B), contradicting optimality of the latter. (Note that = ||iir'||».) 

We can now also conclude that r < k. Fix is [r]. We show that not too many elements of V^j^) can be 
contained in V \ {Ui U ■■■ U Ur} ■ We need the following assumption. 

Assumption 15. For all pairwise disjoint sets Y,X,Z C V such that \Y\ > M — m, \X\ > m, {Y\JX) C Vj 
for some j £ [k], \Z\ <m, ZnVj = 0.- 

1^1 • \Y\dx,Y + (jfy^,- - \Y\ ■ \Z\dY,z > 

■ \Y\ + f - |r| . \Z\) + . (18) 
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The assumption holds with probabihty at least 1 — n ^. To see why, first notice that |X|/(ci + C2) < 
q)\X\ ■ \ Y\ by (jsj), as long as C2 is large enough. This implies that the RHS of (18) is upper bounded 

by 

p-^hp-q)) \X\ ■ \Y\ + -^iC^J) \Y\ ■ \Z\) (19) 



ci +cH 2 



Proving that the LHS of ([18]) (denoted f{X, Y, Z)) is larger than ([19]) (denoted g{X, Y, Z)) uniformly w.h.p. 
can now be easily done as follows. By fixing my — \Y\^mx = the number of combinations for Y^X, Z 
is at most exp{C(my + mx)logn} for some global C > 0. On the other hand, the probability that 
f{X, y, Z) < g{X, Y, Z) for any such option is at most 

exp{-C"(p - qfrnymx} (20) 

for some C > 0. Hence, by union bounding, the probability that some tuple Y,X, Z of sizes m,Y,'mx,'m'Z 
respectively satisfies f{X, Y, Z) < g{X, Y, Z) is at most 

exp{-C"(p-g)2my/2} , (21) 

which is at most exp{— lOlogn} assuming 

M>C{\ogn)/[p-qf , (22) 



for some C > 0. Another union bound over the possible choices of mY,mx-,mz proves that (18) holds 
uniformly with probability at least 1 — n^^. 

Now for some i £ [r] set X :— V0(i) n (V^ \ {Ui U • • • U Ur}) and assume by contradiction that |X| > m. 
Set Y :— V0(i) n Ui and Z — Ui\ V^(i) . Define the solution {K' , B') where K' is obtained from K by replacing 



the block corresponding to Ui in K with two blocks: V0(i) and Ui \ V^(^i). Assumption 15 tells us that the 
cost of {K',B') is strictly lower than that of {K,B). Note that the expression j^ip^ in the RHS of (18) 
accounts for the trace norm difference — ll-^'ll* — \X\. 

We are prepared to perform the final "cleanup" step. At this point we know that for each i £ [r], the 
set Ti = Uid V^fi) satisfies 

m > m-m 

\T^\ > \V,\~rm. 

(The second inequality is implied by the fact that at most m elements of V^f^) may be contained in Ui' for 
i' i, and another at most m elements in V \ {Ui U ■■■ U Ur) ■ We are now going to conclude from this that 
Ui — V0(i) for all i. To that end, let {K' , B') be the feasible solution to (CPl) defined so that K' is a partial 
clustering induced by V^{i), . . . , V^{r)- We would like to argue that \i K ^ K' then the cost of {K',B') is 
strictly smaller than that of {K, B). Fix the value of the collection 

y := ((r,</.(l),...,</.(r), 

{m[ :-|v-^(.)n(y\(c/iU...u;7.)))^^j^,) 

Let I3{y) denote the number oi i ^ j such that rriij > plus the number of z S [r] such that > 0. We 
can assume f3{y) > 0, otherwise Ui = V^j^) for all i € [r] as required. The number of possibilities for K 
and K' giving rise to y is exp{C(^j_^^ rriij + ^•mi)logn} for some C > 0. (Note that K' depends on 
r, (f)(1), . . . , 4>{r) only, while K depends on all elements of y). For each such possibility, the probability that 
the cost of (K, B) is lower than that of {K', B') is at most 

exp{-C"{p - g)'A/(^ m,, + ^ m,)} (23) 
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using Hoeffding inequalities, for some C" > 0. (Note that special care needs to be made to account for the 
difference \\K\\^ — \\K'\\^ = X]i=i " this is similar to what we did above .) As long as 

M>C'fk{logn)/{p~qf (24) 

for some & > 0, we conclude that the cost of {K' , B') is at least that of (K, B) for some K giving rise to 
y with probability at most exp{— 10(fclogn)/3(3^)}. The number of combinations of y for a fixed value of 
I3{y) is at most exp{5(fc + I3{y) logn}. By union bounding, we conclude that for fixed /3(3^), the probability 
that some {K,B) has cost at most that of {K',B') is at most exp{— 10(A; log n)/3(3^)}. Finally union bound 
over all possibilities for f3{y), of which there are at most n^. 

Taking Ci, C2 large enough to satisfy the requirements above concludes the proof. 



D Proof of Theorem [9] 



The proof of Theorem[3]in the previous section made repeated use of Hoeffding tail inequalities, for bounding 
the size of the intersection of the noise support with various submatrices. This is tight for p, q which are 
bounded away from and 1. However, ii p = pp' , q = pq' , the noise probabilities p' , q' are fixed and p tends 
to 0, a sharper bound is obtained using Bernstein tail bound (see Appendix F.2 Lem ma[l7 ). Using Bernstein 
inequality instead of Chernoff inequality, the expression {p — qY in ([q]),! 11 ),( 13 1,( 14 1,(|15[),( 16|,( 17l,(20),(21|, 
(22 1,(23) can be replaced with p. This clearly gives the required result. 



E Proof of Lemma [5] 

Proof. We remind the user that g = ^ log^ n, the multiplicative size of the interval Consider the set of 

intervals {n/gko,n/ko), {n/g'^ko,n/ gko), ■ ■ ■ , {n/g'''>~^^ko,n/g'">ko). By the pigeonhole principle, one of these 
intervals must not intersect the set of cluster sizes. Assume this interval is {n/g'^°^^ko,n/g'^''ko), for some 
< io < kf). Let a — n/g^^^ko. By setting C3{p,q) small enough and Ci{p,q) large enough, one easily 
checks that the requirements of Corollary |4] hold with this value of a and s = n/fco. This concludes the 
proof. □ 



F Technical Lemmas 



F.l The spectral norm of random matrices 

It is well-known that the spectral norm \i{A) of a zero-mean random matrix A is bounded above w.h.p. by 
Cy/n, where C is a constant that might depend on the variance and magnitude of the entries of A. Here we 
state and (re-)prove an upper bound of Xi{A) with an explicit estimate of the constant C, which is needed 
in the proof of the main theorem. 

Lemma 16. Let Aij, I < j < n be independent random variables, each of which has mean and variance 
at most and is bounded in absolute value by B. Then with probability at least 1 — 2n~^ 

'^i(^) < 6 max ^a\/nlogn, Blog^ n| 

Proof. Let be the i-th standard basis in M". Let Z.^ — AijCicJ . Then Zij's are zero- mean random 
matrices independent of each other, and A = J2i j ^ij- We have \\Zij\\ < B almost surely. We also have 
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||E.,jE(^«j^J)ll = \\E^e^eJEJK{Al)\\ < na\ Similarly || E.j E(ZjZ,,)|l < na\ Applying the Non- 
commutative Bernstein Inequality (Theorem 1.6 in ?) with t = 6 max {^a^nlog n, B log^ n} yields the 
desired bound. □ 



F.2 Standcird Bernstein Inequality for Sum of Independent Variables 

Lemma 17. ('Bernstein inequality^ Let Yi,...,Y/v he independent random variables, each of which has 
variance hounded hy and is hounded in absolute value hy B a.s.. Then we have that 



Pr 



The following well known consequence of the theorem will also be of use. 

Lemma 18. (7, Proposition 5.16) Let Yi,...,Y/v be independent random variables, each of which has 
variance hounded hy and is hounded in absolute value hy B a.s. Then we have 

i=l Li=l 

with probability at least 1 — Cin"*^^ where the positive constants Co, Ci, C2 are independent of a, B, N and 
n. 



N 



N 



.i=l . 



> t 



< 2exp 



ty2 

Na'^ + Bt/3 



< Comax|cr-\/7Vlogn, -Blognj 
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